MDEV-21423 - lock-free trx_sys get performance regression cause by lf_find and ut_delay#5043
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new rw_trx_ids_t class to manage read-write transaction IDs and serialization numbers, moving them from the hash elements into a centralized vector within the transaction system. This involves significant updates to transaction registration, deregistration, and snapshotting logic across InnoDB. Feedback highlights a critical thread-safety issue where the ids vector is accessed without a read lock during potential reallocations, which could lead to memory corruption. Additionally, the use of memset to initialize a synchronization primitive was identified as unsafe and should be removed in favor of the standard initialization call.
0de23ce to
ac20be7
Compare
dr-m
left a comment
There was a problem hiding this comment.
This looks very promising. The reason why I won’t give an approval yet is that I didn’t review all details of this thoroughly, especially around startup and shutdown.
This needs to be tested, both for performance and stability. Please coordinate with the testers on this.
|
Test run completed on ac20be7 |
aaeeae2 to
6cf1240
Compare
dr-m
left a comment
There was a problem hiding this comment.
Since my previous review, the single std::vector has been split into two: rw_trx_vector::ids comprising the transaction start and end identifiers, and rw_trx_vector::trxs comprising the pointers to the transaction objects. I understood that this yielded the best performance, presumably because when iterating the identifiers, we would wasting cache line capacity on trx_t* and possible padding.
I only have a minor comment about documenting and enforcing the lifetime of trx_t::rw_trx_ids_slot.
I would request to rebase this to the 11.8 branch, because that is the branch where all InnoDB performance improvements done so far have landed. Also, the target customer for this improvement is moving to the MariaDB Enterprise Server 11.8 release.
| /** trx_sys.rw_trx_ids index, protected by trx_sys.rw_trx_ids.latch */ | ||
| uint32_t rw_trx_ids_slot; |
There was a problem hiding this comment.
When does this value change, and what is its lifetime? I see that we are not initializing this in trx_init() but directly in TrxFactory::init(). There’s no revision to assert_freed() or trx_t::free() either. Normally, we assert that all fields have been cleared, and we mark the memory inaccessible for AddressSanitizer.
I initially suspected that similar to max_inactive_id, this could be a "cache" that can remain valid across transaction restarts that happen to use the same memory buffer. But, it turns out that rw_trx_vector::register_trx() and rw_trx_vector::deregister_trx() guarantee that this field will be std::numeric_limits<uint32_t>::max() whenever a transaction is being allocated or freed.
There was a problem hiding this comment.
It gets valid value at trx_create() / trx_sys_t::register_trx() and the value remains until trx_t::free() / trx_sys_t::deregister_trx().
trx_t::assert_freed() is called from trx_t::commit() / trx_t::commit_cleanup(), which is transaction end. But rw_trx_ids_slot has to live until session end.
trx_init() is not suitable either, trx_create() is the beginning of the scope.
The sole purpose of std::numeric_limits<uint32_t>::max() was to assert that rw_trx_vector is not being accessed with uninitialized rw_trx_ids_slot. But it looks somewhat excessive and can be removed.
| ut_ad(state == TRX_STATE_NOT_STARTED); | ||
| ut_ad(!id); |
There was a problem hiding this comment.
Between these lines we are missing the following:
ut_ad(rw_trx_ids_slot == std::numeric_limits<uint32_t>::max());There was a problem hiding this comment.
trx_t::assert_freed() is called from trx_t::commit() / trx_t::commit_cleanup(), which is transaction end. But rw_trx_ids_slot has to live until session end.
…_find and ut_delay Under high concurrency, MVCC snapshot creation may spend a significant amount of time in lf_hash_iterate()/l_find() while collecting active read-write transaction identifiers. This overhead is particularly visible in sysbench oltp_read_write with transaction-isolation=READ-COMMITTED. Iteration cost becomes high due to significant TLB thrashing and poor memory locality in this hot code path because snapshot creation touches many rw_trx_hash nodes distributed across memory, including dummy nodes that are irrelevant for snapshot construction. In addition, traversing LF_HASH requires issuing heavyweight memory barriers. This is a performance regression after 53cc9aa, which changed MVCC snapshot creation to scan LF_HASH instead of maintaining a global sorted vector protected by the global mutex. Add trx_sys.rw_trx_ids, a compact traversal-friendly vector of active read-write transaction identifiers and serialization numbers optimized for MVCC snapshot creation, while rw_trx_hash remains responsible for transaction lookup. The vector may contain empty slots corresponding to idle or read-only transactions that currently do not own a read-write transaction identifier. Such slots are skipped by snapshot creation. This reduces traversal overhead during MVCC snapshot creation by improving memory locality, reducing TLB pressure, and avoiding repeated memory barriers required for rw_trx_hash traversal.
Under high concurrency, MVCC snapshot creation may spend a significant amount of time in lf_hash_iterate()/lfind() while collecting active read-write transaction identifiers. This overhead is particularly visible in sysbench oltp_read_write with transaction-isolation=READ-COMMITTED.
Iteration cost becomes high due to significant TLB thrashing and poor memory locality in this hot code path because snapshot creation touches many rw_trx_hash nodes distributed across memory, including dummy nodes that are irrelevant for snapshot construction. In addition, traversing LF_HASH requires issuing heavyweight memory barriers.
This is a performance regression after svoj@53cc9aa, which changed MVCC snapshot creation to scan LF_HASH instead of maintaining a global sorted vector protected by the global mutex.
Add trx_sys.rw_trx_ids, a compact traversal-friendly vector of active read-write transaction identifiers and serialization numbers optimized for MVCC snapshot creation, while rw_trx_hash remains responsible for transaction lookup.
The vector may contain empty slots corresponding to idle or read-only transactions that currently do not own a read-write transaction identifier. Such slots are skipped by snapshot creation.
This reduces traversal overhead during MVCC snapshot creation by improving memory locality, reducing TLB pressure, and avoiding repeated memory barriers required for rw_trx_hash traversal.